| cohort_group | cohort_c | n_samples |
|---|---|---|
| Green I | 1 | 40 |
| Pooled | 0 | 15 |
| Red | 6 | 40 |
Urine Metabolomics Pancreatitis QC/QA
Purpose
Quality control / quality analysis (QC/QA) report of the urine metabolomics from pancreatitis samples. There is an Executive Summary at the end of this report.
This report should have been supplied as two versions:
- A Word document.
- A self-contained HTML file with possibly some interactive graphics.
Data
Data consists of urine measured metabolites, and various sample metadata.
Methods
We analyzed all the data using R v 4.3.0 (R Core Team 2021). Data were read in using readxl or readr depending on source. For metadata, extra metadata was joined to basic metadata supplied using the processing_id. Samples were normalized by osmolarity alone, median abundance, or combination of osmolarity and median abundance.
Sample-sample correlations were calculated using information-content-informed Kendall-tau, a modification of Kendall-tau correlation to allow the inclusion of missing values (Robert M. Flight, Bhatt, and Moseley 2022). Almost all of the analysis below uses the intensities directly provided, so the correlations here should correspond almost 1:1 to a normal Kendall-tau.
Statistical tests of sample metadata to sample principal component scores were performed using ANOVA tests implemented in the visualizationQualityControl package v 0.4.10 (Robert M. Flight and Moseley 2021).
We determined the limit of detection (LOD) across all metabolites from the mean and standard deviation of each metabolite in the pooled samples. Generating bins of 0.05 across the mean, we calculated the standard deviation of the standard deviation (SDoSD) of the mean values within each bin, and examine the plot to determine where the SDoSD begines to increase.
All Data
To start, we examine correlation and principal component analysis grouping using all of the data, including the pooled quality-control samples provided by the metabolomics core.
Let’s first double check how many of each disease group of samples we have. The counts are shown in Table 1. Tables 2, 3 show the breakdown by race and gender, respectively.
Table 1. Number of samples in each disease group.
Table 2. Number of samples by race.
| race_pt | n_samples |
|---|---|
| Asian | 3 |
| Black or African American | 14 |
| Do not know | 1 |
| More than one race | 1 |
| White | 61 |
| NA | 15 |
Table 3. Number of samples by gender.
| gender | n_samples |
|---|---|
| F | 45 |
| M | 35 |
| NA | 15 |
Need for Normalization
We can check whether samples need more than osmolarity normalization by examinging boxplots of the metabolite intensity distribution before and after normalization. These are shown in Figure 1. As shown, we don’t think that using osmolarity alone is enough for normalization for these samples. Interestingly, median or osmolarity+median normalization give the same intensities.
Figure 1. Metabolite log(intensity) boxplots for no normalization (none), median normalization (median), osmolarity normalization, or both median and osmolarity normalizatin (osmo-med), with samples colored by which disease group they belong to (cohort_c).
ICI-Kt
For all sample-sample pairs, we calculate the ICI-Kt correlation (see Methods). These correlations are shown as a heatmap in Figure 2. The way to read this heatmap is that the correlation values are encoded as a color, and each square is the correlation of each sample with another sample. So row 1, column 2 (and column 1, row 2) represents the correlation of pooled sample 1 to pooled sample 2. The disease group of each sample is encoded by the colors along the rows and columns of the heatmap. The ordering of the samples in the heatmap is decided by treating the correlation as a similarity (1 - correlation), and then clustering them using hierarchical clustering. Therefore, ideally we would see groups of high correlating samples that also group by their disease status.
Figure 2. ICI-Kt correlation heatmap of all samples, ordered by similarity (1 - correlation) and colored by disease group.
This, we admit is not what we hope to see. There are not large groups of high correlations that also correspond to disease group. In fact, there doesn’t seem to be anything with really high correlation outside of the pooled replicates. The highest correlations should be just off the diagonal, i.e. the correlation with the nearest neighbor, except where there are bigger differences. Let’s see what that distribution looks like, in Figure 3.
Figure 3. Histogram of the direct neighbor ICI-Kt correlations (just off the diagonal) from Figure 2.
So a mean value of 0.62, which isn’t stupendous, but also not too bad either. The real problem is that green and red samples have comparatively high correlations with each other. For example, the ICI-Kt correlation of “lusczek_138” (red) with “lusczek_030” (green) is 0.596.
We can confirm what this really looks like by plotting their actual intensities against each other too, shown in Figure 4.
Figure 4. Plot of the log-intensities of lusczek_138 (red sample) vs lusczek_030 (green sample).
PCA
We double check all of the above using principal components analysis (PCA), which operates slightly differently than the ICI-Kt correlation. Figure 5 shows the first two principal components on the osmolarity+median normalized data, with samples colored by disease group.
Figure 5. PCA plot of samples colored by their disease group.
Green & Red Only
Looking at Figure 5, we can see that all of the pooled samples cluster right in the middle of the plot, and in fact are just off the 0, 0 point on each of PC1 and PC2. It is possible that they are messing with things a bit. Therefore, we restrict the analysis to just green I and red samples (cohort_c values of 1 and 6, control and chronic pancreatitis, respectively).
ICI-Kt
Figure 6. ICI-Kt heatmap of green I and red samples only, samples arranged by their sample - sample similarity.
When grouped by the cohort group, there is nothing obviously wrong in the sample - sample correlations, as shown in Figure 7.
Figure 7. ICI-Kt heatmap of green I and red samples only, samples arranged by their sample - sample similarity within each cohort of samples.
Outlier Samples
We can also use the median correlations and feature intensity distributions within each cohort of samples to check if any are outlier samples that should be removed prior to differential analysis. Figure 8 shows that there is a single sample in cohort 1 (green I) that should be removed.